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Abstract 

Background: Technological developments have increased the feasibility of large scale genetic association studies. 
Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and 
imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium 
among them, and standard multiple regression techniques fail with these data sets due to their high 
dimensionality and correlation structure. There has been increasing interest in using penalised regression in the 
analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not 
perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore 
desirable to obtain an estimate of the significance of each ridge regression coefficient. 

Results: We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, 
we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage 
of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the 
p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the 
change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a 
lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition. 

Conclusions: The proposed test is a useful alternative to a permutation test for the estimation of the significance 
of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative 
graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage 
parameter increases, and the proposed test makes its production computationally feasible. 



Background 

Genetic data collected in case-control or cohort studies 
of well-defined disease phenotypes can be used to iden- 
tify genetic variants, typically single nucleotide poly- 
morphisms (SNPs), associated with disease status. In a 
genetic association study, the data consist of genotypes 
and corresponding phenotypes from large numbers of 
individuals with the disease phenotype of interest (cases) 
and disease-free controls. A significant difference in the 
frequency of a genetic variant between the case and con- 
trol groups is taken to be indicative of an association 
between that variant (or, more probably, a nearby causal 
variant correlated with the typed variant) and the pheno- 
type of interest. A framework also exists for analysing a 
continuous phenotypes. Once one or more associated 
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variant have been identified, they can be included in pre- 
dictive models for the estimation of disease risk in indivi- 
duals for whom the (potentially future) disease status is 
unknown. Technological developments, including com- 
mercially available chips for typing millions of genetic 
variants simultaneously, and next-generation sequencing 
technologies such as those used in the 1000 Genomes 
Project [1], are enabling the collection of large amounts 
of genetic data from large numbers of individuals which 
means that analysis of contemporary genetic data sets 
involves the study of high-dimensional data. 

A number of statistical approaches have successfully 
been used to investigate the strength of association 
between genetic variants and a phenotype of interest. 
These methods include testing the significance of the 
association of each variant with the phenotype indepen- 
dently using standard univariate tests (such as the 
Cochran- Armitage test for trend [2] or Fisher's exact 
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test). However, there are disadvantages to relying on uni- 
variate methods for the analysis of genetic data. The data 
from a genome-wide association study typically consists 
of thousands or millions of SNPs, and this large number 
of predictors introduces a multiple testing problem. Uni- 
variate methods therefore require stringent significance 
thresholds due to the large number of tests being under- 
taken to prevent a high false-positive rate [3]. Further, 
univariate methods fail to take into account the com- 
bined effect of multiple SNPs. It is plausible that many 
genetic variants together contribute to the phenotype 
being studied [4], and one advantage of using multivari- 
ate methods is that they allow for the study of the com- 
bined effect of multiple SNPs. Multivariate methods 
allow for the control of confounding variables, as in the 
lung cancer replication study in this paper, where gender, 
smoking status and age were included as unpenalised 
covariates in the model. 

Using multivariate methods, the regression model can 
be extended to include, for example, interaction or higher 
order terms, and in such a case a penalised regression 
approach would be appropriate [5]. Penalised regression 
methods have been applied to genetic data [6,7]. Among a 
number of regression approaches used for prediction in 
high-dimensional data, ridge regression has been shown to 
perform best in terms of prediction error [8]. Ridge regres- 
sion has successfully been used to analyse genetic data 
where SNPs were in high LD [9]; it is the test of signifi- 
cance used by Malo, Libiger & Schork that we evaluate 
here. 

To begin, we consider two regression models com- 
monly used in the analysis of genetic data - the linear 
and the logistic regression models, as follows. 

The standard linear regression model is given by 

Y = Xfi + e (1) 

where Y is a (« x 1) vector of dependent variables, Y it i = 
(1, n) and X is a (« x m) matrix of predictors. /J is a (m 
x 1) vector of regression coefficients fipj = (1, m) and e 
is a (n x 1) vector of normally distributed random errors, 

with S{ ~ jV(o, cr 2 ) • An example would be a model of the 
relationship between a continuous phenotype (such as 
blood pressure or plasma lipid concentration) measured in 
n individuals, and the genotype of these n individuals at m 
SNPs. 

The ordinary least squares estimator for /} is given by 
P = (X'X) _1 X'Y (2) 

The significance of individual OLS regression coeffi- 
cients fij in a multiple regression model can be esti- 
mated using a Wald test. The test statistic is 




where se(Pj) is an estimate of the standard error of the 
/ regression coefficient. Under the null hypothesis 
Ho : Pj = 0 , Tb follows a Student t distribution with n - m 
degrees of freedom. 

Binary outcomes commonly arise in biomedical data 
where they may represent, for example, cases and controls. 
In the logistic regression model, Y is an K-dimensional 
vector of response variables taking values 0 (controls) or 1 
(cases), and X the n x m matrix of explanatory variables, 
as before. For the i individual we denote 

Xi = (Xji, . . . ,Xim) 

The i response Y, is a Bernoulli variable with prob- 
ability of success equal to p t . The logistic regression 
model relates the probability p t that the i observation 
is a case to the predictor variables as 

P ^ = 1 W=ft=7^ (3) 

where /J is a vector of parameters to be estimated. 
The significance of individual logistic regression coeffi- 
cients, Pj , can be estimated using the test statistic 

t A" 

Jo = — 

se{pj) 

where se(Pj) is an estimate of the standard error of 
the f h regression coefficient. Under the null hypothesis 
that /3 y = 0, To asymptotically follows a standard normal 
distribution. 

Genetic data often comprises more predictor variables, 
m, than observations, n. In such a situation, unique maxi- 
mum likelihood estimates of regression parameters do not 
exist. Further, collinearity in the predictors, due to linkage 
disequilibrium (LD) in genetic data, which typically 
increases with the increasing density of available markers, 
results in unstable maximum likelihood estimates of 
regression coefficients. 

An extensive literature exists on the application of 
modified regression techniques to the analysis of high- 
dimensional data. Penalised regression constrains the 
magnitude of the estimated regression coefficients, 
allowing their estimation when ordinary least squares 
(OLS) estimates cannot be obtained. In a Bayesian con- 
text, these techniques are equivalent to the specification 
of a particular prior distribution on the coefficients. For 
example, Lasso regression [10] constrains the sum of 
the absolute value of the regression coefficients to be 
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less than a constant. This is equivalent to imposing a 
double exponential prior centred at zero on the coeffi- 
cients. Lasso regression can estimate some coefficients 
to be exactly zero, permitting dimension reduction in 
the model. Hoggart, Whittaker, De Iorio & Balding [6] 
considered a modified Lasso regression approach for 
the identification of causal SNPs in genome-wide or 
resequenced data, with the aim of identifying regions of 
association whilst considering all SNPs simultaneously. 
When SNPs are in high LD, their method offers 
improvement over both single-SNP analysis and Lasso 
regression in terms of the power to detect causal var- 
iants, and a notable improvement over single-SNP ana- 
lysis in terms of false-positive rate. 

Ridge regression [11] is a another penalised regression 
approach, in which a penalty is applied to the sum of the 
squared parameter estimates. Ridge regression has been 
used in a number of large-scale data analysis scenarios, 
including marker-assisted selection [12], expression data 
analysis [13], and genetic association studies when SNPs 
are in high LD [14]. From a Bayesian viewpoint, ridge 
regression can be considered as standard multiple regres- 
sion with the coefficients estimates having a prior distri- 
bution that is normal with mean zero and known 
variance [15]. In genetic epidemiology, it is desirable to 
estimate the strength of the association between a variant 
and a phenotype. This is problematic when using ridge 
regression which, unlike other penalised regression 
approaches, does not reduce the number of parameters 
in the model, nor estimate the significance of each fitted 
coefficient. 

A test of significance for coefficients estimated using 
ridge regression, based on an approximation of their dis- 
tribution under the null hypothesis, was proposed by 
Halawa & El Bassiouni [16]. The test was originally 
developed and evaluated for data with continuous out- 
comes, when different methods were used to compute 
the shrinkage parameter, A, which controls the degree of 
shrinkage of the regression coefficients and hence their 
distribution under the null hypothesis. Malo, Libiger & 
Schork [9] used the same test in an evaluation of the 
applicability of ridge regression as a means of accommo- 
dating LD in association studies. They used the test in a 
comparison of the performance of ridge regression, mul- 
tiple regression and single-SNP analysis when SNPs are 
in varying degrees of LD. They found that ridge regres- 
sion identified different SNPs as associated with pheno- 
type compared to single-SNP analysis or multiple 
regression. However, they did not consider the perfor- 
mance of the test itself. 

In this paper, we evaluate the performance of a test of 
significance for ridge regression coefficients. Our test is 
based on the test proposed by Halawa & El Bassiouni 
[16]. We extend the test, making it applicable in the 



m > n scenario that is common in contemporary genetic 
data sets. We evaluate the performance of the test in 
simulation studies, using scenarios representative of rea- 
listic high-desnity genetic marker data, considering a 
range of data set dimensions and degrees of shrinkage. 

Ridge regression has also been applied in the logistic 
regression framework [17]. We extend the test proposed 
by Halawa & El Bassiouni [16] to the logistic ridge 
regression model, and again evaluate the test in a range 
of simulation scenarios at different values of A. 

In both linear and logistic ridge regression, we com- 
pare the approximate test of significance to a permuta- 
tion test. We view the permutation test as a benchmark 
as it produces an estimate of the null distribution of the 
parameter estimates. However, the permutation test is 
computationally intensive and becomes more so when 
data are high-dimensional. The test we propose makes it 
feasible to estimate significance with a much lower com- 
putational burden. 

We introduce the /?-value trace, a plot of the negative 
logarithm of the ^-values of the ridge regression coeffi- 
cients with increasing shrinkage parameter. This plot 
enables the visualisation of the relative change in signifi- 
cance of each coefficient, and facilitates the identification 
of predictors most affected by increased penalisation in 
terms of significance. 

We apply the approximate test of significance for logistic 
ridge regression coefficients to a lung cancer data set, 
demonstrating the utility of the test when correlation 
exists among the predictors. 

This paper is organised as follows. We first describe the 
approximate test of significance and the permutation test 
to which it is compared. We then describe the simulation 
studies used in this paper. In the Results section we evalu- 
ate the performance of the proposed test in a range of 
simulation scenarios. Further, we apply the test to a lung 
cancer case control data set. In the Discussion we discuss 
the results and potential applications for the test. 

Methods 

Significance testing in linear ridge regression 

Ridge regression replaces the OLS estimator j} (equa- 
tion (2)) with the ridge regression estimator [11]: 

fi X = (X'X + AI^X'Y ( 4 ) 

Where collinearity exists in X, the OLS estimates of )3 
can be unstable, having large variance. Hoerl & Kennard 
[11] demonstrate that there exists a value of A for which 

the ridge regression estimates have smaller mean 

square error (MSE) than the OLS estimates. Where m > 
n, OLS estimates of /J cannot be obtained because the 
matrix X'X is singular. The addition of the constant A 
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to the diagonal of the XX matrix makes it invertible, so 
ridge regression estimates can be obtained. 

Halawa & El Bassiouni [16] use simulation studies to 
investigate applications of a 'non-exact' i-type test for 
the individual coefficients of a linear regression model 
fitted using ridge regression, based on the i-test above. 
The test statistic is 



P] 



se{pf) 

where /y is the estimate of the / regression coeffi- 
cient under the ridge regression model, and se{ ft> ) is an 
estimate of the standard error. 

Estimates of the standard error of the f element of 
j) X are obtained as the square root of the / element of 
the diagonal of the covariance matrix 

Var (p k ) = ct 2 (X'X + A.I) _1 X'X(X'X + Xiy 1 

In practise, a 2 is replaced by its estimate, given by the 
residual mean square of the ridge model: 



(Y-X/?)'(Y-X/J) 



(5) 



v is the residual effective degrees of freedom. Halawa & 
El Bassiouni [16] use v = n - m. However, when m > n 
this gives a negative estimate of the residual mean square. 
Instead, we use the definition of residual effective degrees 
of freedom given in Hastie & Tibshirani [18], which 
makes use of the "hat matrix", H: 



Y = X(X'X + Xl) J X'Y 



HY 



(6) 



(7) 



H is termed the 'hat matrix', because it 'puts the hat 
on' Y, transforming it to Y. Degrees of freedom for 
error are defined as 



v = n- tr(2H - HH') 



(8) 



In linear regression, the hat matrix reduces to H = X 
(X'X)" 1 X' and n - tr (2H - HH') reduces to n - m. The 
test statistic Tx is assumed to follow a Student t distri- 
bution as in standard multiple linear regression. How- 
ever, the effective number of parameters of the 
penalised regression fit is smaller than m. Hastie and 
Tibshirani define tr (H) as the degrees of freedom taken 
up by the penalised regression fit [18]. tr (H) reduces to 
n - m in ordinary linear regression. Then, Tx is assumed 
to follow a Student t distribution with n - tr (H) degrees 
of freedom. 



In the case of large sample size, as is typically the case 
in genetic data, the distribution of the test statistic is 
asymptotically normal, as noted by Halawa & El Bas- 
siouni [16]. We compared the significance levels of the 
approximate test assuming both a normal and a Student 
^distribution of the test statistic and found that the 
results were substantially identical. Therefore we assume 
that under H 0 , Tx ~ M (0, 1) and use the normal distri- 
bution to test the significance of ridge regression coeffi- 
cients. The results from the corresponding tests 
assuming that under H 0 , Tx ~ t n . tr ( H ) are provided in an 
Additional File. See Additional File 1, Tables SI and S2. 

Significance testing in ridge logistic regression 

Ridge regression has been applied to the logistic regres- 
sion model [17,19]. Cessie & van Houwelingen [17] 
show how ridge regression can be used to improve the 
parameter estimates in logistic regression when the 
number of predictors is relatively large or highly corre- 
lated. They discuss different ways of choosing the 
shrinkage parameter to minimize prediction error. Vago 
& Kennedy [19] apply ridge logistic regression to a clini- 
cal data set. 

In logistic ridge regression, the log-likelihood function 
is penalised with the penalty applied to the L 2 norm of 
fS [19]. Maximum likelihood estimates of /J are obtained 
by maximising the logarithm of the likelihood function 
[19], typically using the Newton-Raphson algorithm. 
The approximate test statistic is 



Standard errors of the coefficient estimates are 
obtained as the square roots of the /* element of the 
diagonal of the covariance matrix. This matrix is esti- 
mated from the final Newton-Raphson iteration: 



Var (k X \ = Var [(X'WX + 2AI) J X'Wzj 



d 2 l 



HP) 



d 2 l 



(dp*)/}' I "'"Knpup' 

= (X'WX + 2X1) _1 (X'WX) (X'WX + 2X1) _1 

where / {pi) is the observed information matrix, i is 
the (penalised) log-likelihood, and W is the weight 
matrix: 

W = diag[p,(l-p i )] 



Pi = 



1 +e x ^ 1 
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z is an n x 1 column vector with elements 
Zi = logit [pi] 



p(i-pO 

Again we assume that under H 0 , T\ ~A/"(0, 1) and 
use the normal distribution to test the significance of 
ridge regression coefficients. 

Permutation test 

To evaluate the proposed test, we compare its perfor- 
mance to that of a permutation test, which we view as a 
benchmark. In the permutation test, to obtain a null dis- 
tribution of the regression coefficients, the elements of 
the outcome vector Y are randomly permuted. The ridge 
regression model is fitted using the permuted observa- 
tions, to obtain ridge regression coefficients. By perform- 
ing 1000 such permutations, a null distribution of the 
regression coefficients is generated. The permutation test 
j?-value is calculated as the proportion of regression coef- 
ficients from the null distribution greater than or equal 
in absolute value to the absolute value of the coefficient 
fitted to the true (non-permuted) data. 

Choice of shrinkage parameter 

Hoerl & Kennard [11] present an existence theorem for 
ridge regression. They demonstrate the existence of a 
value of the shrinkage parameter A in equation (4) 

which will give estimates ft with smaller mean squared 
error than the OLS estimates j$ given in (2). 

However, to date no analytical method to find the 
'best' value of A in terms of minimising MSE has been 
determined. A number of data-driven methods have 
been proposed. These methods aim to determine a 
value of A based on the data that will result in estimates 

of a X with improved mean squared error properties. For 
example, Hoerl, Kennard & Baldwin [20] propose the 
following as an estimate of A, with ft 2 and j} taken 
from the OLS estimates: 



X 



HKB 



PP 



An alternative estimate was suggested by Lawless & 
Wang [21], based on the adoption of the Bayesian per- 
spective mentioned above. Cross-validation based meth- 
ods have also been proposed in the literature [22]. 

Hoerl & Kennard [11] introduce the ridge trace, a plot 

of the estimates p as A increases from zero - see for 

example Figure 1. They propose choosing A correspond- 
ing to the region on the ridge trace at which estimates of 
a no longer change significantly as A increases further. 



rs8034191 

- - rs1 6969968 
- ■ rs402710 
other SNPs 



Shrinkage parameter 

Figure 1 Ridge trace - Lung cancer data. Change in ridge 
regression coefficients with increasing shrinkage parameter. The 
three SNPs that have previously been shown to be associated with 
lung cancer risk, rs8034191, rs1 6969968 and rs402710, are shown in 
bold. Other SNPs are represented by dotted lines. The vertical line is 
at X = 150, where the ridge estimates stabilise. 



Following the ridge trace of Hoerl & Kennard [11], we 
introduce a plot of ^-values of the regression coefficients 
against A as A increases from zero (Figure 2). We refer 
to this plot as a '/7-value trace'. This />-value trace 
enables the visualisation of the change in /^-values of the 
regression coefficients with increasing shrinkage. 

Simulation study 

The proposed test was evaluated using simulated geno- 
type data. FREGENE software [23,24] was used to simu- 
late a population of haplotypes. FREGENE simulates the 
forwards-in-time evolution of sequence-like genetic 
data. The forward-in-time simulation allows demo- 
graphic and selection scenarios to be implemented and 
recombination to be modelled. Haplotypes used in this 
study are taken from a simulation representing the neu- 
tral evolution of 10,500 individuals over 200,000 genera- 
tions, with recombination parameters that realistically 
model recombination in humans. This results in a popu- 
lation of 21,000 haplotypes. The simulated haplotypes, 
together with details of the simulation, are available to 
download from http://www.ebi.ac.uk/projects/BARGEN/ 
[25]. 

Based on these haplotypes, genotypes and correspond- 
ing phenotypes were generated as follows. A range of 
scenarios were considered, comprising n individuals at 
m SNPs. 
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Shrinkage parameter Shrinkage parameter 

Figure 2 p-value traces - Lung cancer data. Change in p-values of ridge regression coefficients in lung cancer data with increasing shrinkage 
parameter, p-values are plotted on the - log scale. The three SNPs that have previously been shown to be associated with lung cancer risk, 
rs8034 191, rs1 6969968 and rs402710, are shown in bold. Other SNPs are represented by dotted lines. The vertical line is at A = 150, where the 
ridge estimates stabilise, a approximate p-values, b permutation test p-values. 



1. Designate one SNP as the causal SNP, selected at 
random from those with population minor allele fre- 
quency in the range 0.10 - 0.15. A subregion of m 
SNPs from the 1 Mb region, containing this causal 
SNP, is used for the genotype simulation. The subre- 
gion is selected at random to be of length m and to 
contain the designated causal SNP. Thus the correla- 
tion structure among the subregion depends on the 
correlation structure of the haplotype region at that 
point. In the resultant sample, correlation with the 
causal SNP ranges from low to perfect (r 2 = 1). 

2. Sample two haplotypes (with replacement) from 
the population of 21,000 haplotypes. Sum the minor 
allele count at each SNP to form a genotype. 

3. Simulate the phenotype for this individual. 
Continuous phenotypes were generated as 

Case-control phenotypes were generated following 
the liability model used by [26]. The penetrance func- 
tion, fa is the probability of being a case, Pr (Yj = 1) 
given having k copies of the minor allele at the causal 
SNP. The genotype relative risk, r, is fa I fa and K is 
the population prevalence. Then, with the population 
frequency of the minor allele of the causal SNP as p, 
under an additive genetic model, fa = K/(l - 2p + 
2pr),fa = rfo and fa = 2r fa - fo- A sample of n/2 cases 
and n/2 controls was generated by generating an indi- 
vidual genotype as described above, then assigning 



the individual to be a case with probability fa and a 
control otherwise. This process is repeated until n/2 
cases and n/2 controls are obtained. 
4. Record the minor allele count (0, 1, 2) at the m 
SNPs for the i th individual, giving rise to an n x m 
matrix of minor allele counts. 

Ridge regression coefficients were fitted to data with 
continuous outcomes using lm. ridge from the pack- 
age MASS in R [27] for both the simulated data and the 
permutation test. 

Estimates of regression coefficients under logistic ridge 
regression models were computed using the Newton- 
Raphson algorithm. 

In the case of both continuous and binary outcomes, 
SNPs that were invariant in the sampled genotypes were 
removed from the data, and genotypes were standar- 
dised, prior to analysis. 

The two tests were evaluated using the true positive rate 
(TPR) and false positive rate (FPR), averaged over all the 
replicates for each simulation scenario. We define TPR to 
be the proportion of causal SNPs, as designated in the 
data simulation, significantly associated with phenotype at 
the nominal threshold a = 0.05. TPR is not reported for 
the null simulations, as there is no causal SNP associated 
with phenotype in these data. We define FPR to be the 
proportion of non-causal SNPs significantly associated 
with phenotype at the same significance threshold. 
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Results and Discussion 
Null Simulation 

Genotypes and corresponding phenotypes were gener- 
ated in two different sized data sets: (1) n = 500, m = 20 
and (2) n = 1000, m = 1000. In generating the null data, 
no SNP was designated the causal SNP. Continuous 
phenotypes were generated as F; ~ N(0, 1); binary phe- 
notypes were generated as Y t ~ Binom(l, 0.5). False 
positive rates are reported at the nominal significance 
threshold a = 0.05. Four values of the ridge parameter A 
were used. 

Results of the null simulations are shown in Table 1. 
We find that approximate test gives similar results to 
the permutation test in terms of false positive rate, espe- 
cially in the case of continuous outcomes. 

Continuous Phenotypes 

Genotypes and corresponding phenotypes were gener- 
ated as described above, for a range of data sets dimen- 
sions: n = 500, 1000, 5000 and m = 20, 100, 1000 and all 
SNPs in the 1 Mb region (approximately 10,000 SNPs). 
Phenotypes were generated as Y{ ~ N[\ + 2k, 1) with k 
being the minor allele count at the causal SNP. A range 
of values of the shrinkage parameter A were used: A = 
0.1, 1, 10, 100. In Figure 3, the left column shows the null 
distributions, generated in a permutation test, used to 
estimate the significance of a ridge regression coefficient 
for a significant SNP (top row) and for a SNP that is not 
associated with phenotype (bottom row). The coefficient 
fitted to the original data is indicated. In the right col- 
umn, the null distribution of the test statistic used in the 
approximate test is shown, with the test statistic of the 
fitted coefficient indicated. Ridge regression models were 
fitted using the shrinkage parameter A = 1. These results 
are examples taken from a single simulation, and above 
each plot the /?-value according to the permutation test 
(left) or the approximate test (right) is shown. 

Table 2 compares the performance of the approximate 
and permutation tests in different simulation scenarios 
and at different values of the shrinkage parameter. We 
see that the approximate test performs well compared 



to the permutation test in terms of power (true positive 
rate) and that it has a slightly higher false positive rate. 

When ranking the SNPs in order of significance, the 
approximate test and the permutation test ranked the 
SNPs identically or nearly so (results not shown). 

Figure 4 shows a Bland-Altman plot [28] of difference 
(permutation test j?-value - z-type test p-value) against 
mean for the j5-values of 1000 SNPs in 5000 individuals, 
/^-values are plotted on the -log scale. We see that the 
bias is towards smaller j?-values from the approximate 
test, which is congruous with the higher false positive 
rate for the approximate test shown in Table 2. 

Continuous Phenotypes with Multiple Causal SNPs 

For complex diseases, multiple causal SNPs are likely to 
affect the phenotype. We investigated the performance of 
the test when more than one SNP is associated with phe- 
notype. We simulated data from two different scenarios: 
n = 500, m = 100 and n = 500, m = 1000. In each region 
of m simulated genotypes, ten SNPs with minor allele fre- 
quency 0.10 - 0.15 were designated causal and given 
effect size 1; the non-causal SNPs had effect size 0. Phe- 
notypes were simulated as Y = X/J + e, e ~ A/"(0, a 2 l), 
O 2 = 1 where /J is the vector of effect sizes. 

Results are presented in Table 3. We see that the con- 
clusions drawn about the test, of adequate power at the 
cost of a slightly higher false positive rate, are equally 
valid when multiple SNPs in the data are associated 
with phenotype. 

Computational performance comparison 

Using an example simulation, we compared the computa- 
tional time required to compute the approximate and the 
permutation tests. A data set with dimensions n = 1000, m 
= 1000 and A = 1 was used. Approximate test and permu- 
tation test p - values were computed and the time taken to 
arrive at the /J-values was recorded. Calculations were 
done using R version 2.12.0 [27] on an iMac running Mac 
OS X Version 10.6.7, fitted with an 2.8 Ghz Intel Core 17 
processor and 16 GB 1067 MHz DDr3 RAM. Computa- 
tional times are compared in Table 4. We see that the 



Table 1 Performance comparison in null simulation 
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False positive rates at the nominal significance threshold a = 0.05 in null datasets. In each scenario, results are averaged over ten replicates 
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Figure 3 Null distributions of the regression coefficients and tests statistics in the ridge regression model for data with continuous 
outcomes. Left column - histogram of fitted coefficients obtained from the permutation test, with the coefficient fitted to the original data 
indicated by a vertical line. Right column - Null distribution of the test statistic used in the approximate test (a standard normal distribution), 
with the test statistic of the fitted coefficient indicated by a vertical line. Top row, a SNP associated with phenotype; bottom row, a SNP not 
associated with phenotype. The causal and non-causal SNPs are from the same replicate in the simulation study where n = 500, m = 20, X = 1. 
p-values are shown above each plot. 



permutation test takes approximately 500 times longer to 
compute than the approximate test. 

Binary phenotypes 

Genotypes and corresponding binary phenotypes were 
generated for nine different data set dimensions: n = 
500, 5000 and m = 20, 100, 1000, 2000, and n = 500, 



m = all SNPs in the 1 Mb region (approximately 10,000 
SNPs). The genotype relative risk, r, was specified as 2. 
For the largest data dimensions (« = 500, m ~ 10, 000), 
results are not shown for the permutation test, due to 
the computational time required to fit ridge regression 
models on data of this size. Similarly, results are not 
shown when n = 5000 and m ~ 10, 000. 
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Table 2 Performance comparison in simulated data with continuous outcomes 
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Performance comparison between a permutation test and the approximate test. Data are simulated genotype data with continuous phenotypes. Reported are 
proportion of true positive and false positive results at significance threshold a = 0.05. TPR = True Positive Rate, FPR = False positive rate. Results for each 
simulation scenario are averaged over ten replicates. 



In Figure 5, the left column shows the null distributions, 
generated in a permutation test, used to estimate the sig- 
nificance of a ridge regression coefficient for a significant 
SNP (top row) and for a SNP that is not associated with 
(bottom row) in data with binary outcomes. The coeffi- 
cient fitted to the original data is indicated. In the right 
column, the null distribution of the test statistic used in 
the approximate test is shown, with the test statistic of the 
fitted coefficient indicated. Ridge regression models were 



fitted using the shrinkage parameter A = 1. These results 
are examples taken from a single simulation, and above 
each plot the p-value according to the permutation test 
(left) or the approximate test (right) is shown. 

Figure 6 compares the ranking of the SNPs from most 
significant (rank = 1) to least significant. Only twelve 
SNPs are shown because the SNPs that were invariant in 
the data were removed before analysis. The SNPs were 
ranked according to both the approximate test and the 
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Figure 4 Bland-Altman plot of mean versus difference for 
p-values computed using both tests, p-values were computed 
using both the approximate test and a the permutation test for 
data with continuous outcomes, and are plotted on the -log scale. 
n = 5000, m = 1000, A = 0.1. 



Table 4 Comparison of computational performance 



permutation test. From Figure 6 we see that whilst the 
ranking of the SNPs was not identical, the most strongly 
associated SNPs are ranked as such by both tests. 

Table 5 compares the performance of the approximate 
test and a permutation test of significance for different 
sized data sets and at different values of the shrinkage para- 
meter: A = 0.1, 1, 10, 100. For most data set dimensions and 
values of A, the z-type test is more conservative than the 
permutation test, with lower true positive and false positive 
rates. This is in contrast to the linear regression case. 

Comparison with univariate tests of significance 

The performance of tests of significance of ridge regres- 
sion coefficients, in terms of true and false positive 

Table 3 Multiple causal phenotypes 



Approximate test 



Permutation Test 



0.1 



1 



10 100 0.1 



10 100 



SNPs 



100 TPR 0.624 0.717 0.939 1.000 0.252 0.312 0.517 0.910 

FPR 0.064 0.061 0.091 0.250 0.001 0.001 0.004 0.078 

1000 TPR 0.210 0.250 0.670 0.970 0.020 0.070 0.170 0.770 

FPR 0.074 0.058 0.060 0.100 0.000 0.000 0.001 0.011 

Performance comparison between a permutation test and the approximate 
test in simulated genotype data with continuous phenotypes and multiple 
causal SNPs. Reported are proportion of true positive and false positive results 
at significance threshold a = 0.05. n = 500 and in each scenario ten SNPs are 
with MAF 0.10 - 0.15 are designated causal with effect size 1; the rest of the 
SNPs have 0 effect size. TPR = True Positive Rate, FPR = False positive rate. 
Results for each simulation scenario are averaged over ten replicates. 





Approximate test 


Permutation test 


time (seconds) 


1.936 


1043.604 



Comparison of computation performance of approximate test and 
permutation test, n = 1000, m = 1000 and X — 1. Approximate test and 
permutation test p - values were computed and the time taken to arrive at 
the p-values was recorded. Calculations were done using R version 2.12.0 [27] 
on an iMac running Mac OS X Version 10.6.7, fitted with an 2.8 Ghz Intel Core 
i7 processor and 16 GB 1067 MHz DDr3 RAM. 

rates, was compared to the performance of univariate 
tests. Comparisons were made in each of the simulation 
settings: the null simulation, and the simulations with 
continuous and with binary outcomes. The results are 
shown in Additional File 1, tables S3, S4 and S5. As 
would be expected when using a penalised regression 
approach such as ridge regression, the performance of 
the corresponding significance test depends on the 
degree of shrinkage, which is controlled by the shrink- 
age parameter A. The performance of the approximate 
test is comparable to that of a permutation test, with 
the advantage of a much-reduced computational burden. 
Further, ridge regression has the advantage over univari- 
ate tests of significance that it results in a much lower 
false positive rate. These advantages of ridge regression 
compared to univariate methods are further illustrated 
in the study of lung cancer data which follows. 

Lung cancer data 

Genome-wide association studies have identified SNPs 
associated with lung cancer disease status. SNPs have 
been identified at chromosomal locations 15q25 [29,30], 
5pl5 [31] and 6p21 [32]. The associations at 15q25 and 
5pl5 have been replicated in white populations, but the 
association at 6p21 has not [33]. Not all studies success- 
fully replicated the associations at 15q25 and 5pl5 [34]. 

Here, we use ridge regression to re-evaluate a set of 35 
SNPs for association with lung cancer disease status. 
Whilst these data are not as high-dimensional as those 
from a genome-wide study, they allow us to illustrate the 
features of using ridge regression for genetic data. We 
show that ridge regression is a useful technique when 
data are correlated, and illustrate that multivariate meth- 
ods have advantages over univariate tests of significance. 

Data consist of genotypes and non-genetic predictors 
from approximately 4000 individuals in the European 
Prospective Investigation into Cancer and Nutrition 
(EPIC, [35]). Missing genotypes were imputed using 
mean imputation. Gender, smoking status and age were 
included as unconstrained parameters in the model. 

For the purpose of comparison, univariate (SNP-by- 
SNP) j?-values were calculated. Univariate jj-values were 
calculated by fitting a logistic regression model for each 
SNP independently, with gender, smoking status and 
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coefficient estimate T\ 

Figure 5 Null distributions of the regression coefficients and tests statistics in the ridge regression model for data with binary 
outcomes. Left column - histogram of fitted coefficients in the permutation test, with the coefficient fitted to the original data indicated by a 
vertical line. Right column - Null distribution of the test statistic used in the approximate test (a standard normal distribution) with the test 
statistic of the fitted coefficient indicated by a vertical line. Top row, a SNP associated with phenotype; bottom row, a SNP not associated with 
phenotype. The causal and non-causal SNPs are from the same replicate in the simulation study where n = 500, m = 20, A = 1. p-values are 
shown above each plot. 



age included in the model. A Wald test was then used 
to estimate the significance of the coefficient. 

In this example, m < n and (unpenalised) multivariate 
regression does give rise to unique parameter estimates. 
Multivariate /^-values (equivalent to ^-values from the z- 
type test for ridge regression coefficients with a ridge 
penalty of zero) are also reported here. 

In a data set of this size, it is computationally feasible to 
fit a ridge regression model with a range of values of A. 
The ridge trace (Figure 1) is a plot of parameter estimates 
against A [11]. The ridge trace typically suggests a range 
of values of A rather than a single best value. In Figure 1, 



A = 150 seems to be the point at which the ridge esti- 
mates stabilise, and Table 6 reports the corresponding 
approximate j5-values. 

Figure 1 shows the ridge trace, and Figure 2 shows a plot 
of /^-values with increasing A. Due to low LD between 
most of the SNPs, most coefficient estimates do not change 
significantly with increasing A and nor do estimates of their 
significance. SNPs rs8034191 and rsl6969968 are both 
located at 15q25 and are in high LD (r 2 = 0.961 in HapMap 
CEU population, r 2 = 0.81 in our data). In contrast to most 
of the SNPs, coefficient estimates for these two SNPs do 
change rapidly with change in the shrinkage parameter. 
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o approximage test p-value 
x permutation test p-value 



SNP 

Figure 6 Comparison of the ranking of SNPs SNPs are ranked by 
p-value, from 1 (most significant) to least significant, by both the 
approximate test and the permutation test. Data are genotypes with 
binary outcomes. The SNP name marked with an asterisk * is the 
causal SNP, and is the most significant SNP according to both tests. 
n = 500, m = 20, X = 1. Only twelve SNPs are shown because the 
SNPs that were invariant among the sampled individuals were 
removed from the data before analysis. 



Further, with increasing shrinkage and stabilisation of 
the estimates, the approximate /"-values for these SNPs 
become significant (Figure 2). rs402710, which is not in 
LD with other SNPs in the data, is significant in a multi- 
ple logistic regression model even when no penalty is 
included in the model (A = 0). A further SNP, rs671330, 
in chromosome 6, has a nominally significant /7-value in 
the approximate test (ranging from 0.048 to 0.040 with 
increased shrinkage), but again this SNP has not pre- 
viously been shown to be associated with lung cancer 
risk. 

The p-value trace using permutation test ^-values 
(Figure 2b) shows good agreement with the approximate 
p-value trace (Figure 2a). Figure 2b is much more com- 
putationally expensive to produce than Figure 2a, thus 
the use of the approximate test makes the plotting of a 
p-value trace for a range of values of X more feasible. 

Table 6 presents the univariate, multivariate, approxi- 
mate and permutation test p-values at A = 150 for four 
SNPs from the regions which have previously been 
shown to be associated with lung cancer disease status. 
Using ridge regression, we replicate the previously found 
associations at 15q25 and 5pl5, but fail to replicate the 
association at 6p21. Using the univariate test, a further 
SNP, rs6746834 (on chromosome 2) was nominally sig- 
nificant at the 0.05 level (p = 0.049), but association at 
this region has not previously been shown. This SNP 



was not found to be significantly associated with lung 
cancer disease status by the approximate test. We inter- 
pret this as a false positive that arises when univariate 
tests are used. 

The results in Table 6 demonstrate the advantage of 
multivariate tests, and specifically of ridge regression, 
over univariate tests of significance. SNP rs402710, 
which has previously been shown to be associated with 
lung cancer disease status [31], was not found to be sig- 
nificant using the univariate test but was found as such 
by the multivariate methods. The two SNPs that are 
correlated, rs8034191 and rsl6969968, were not signifi- 
cant in multiple regression but were significant in ridge 
regression, demonstrating the advantage of using ridge 
regression when SNPs are correlated. 

These results demonstrate that this approximate test of 
significance for coefficients fitted using logistic ridge 
regression reproduces previously ascertained associations, 
at reduced computational cost compared to a permutation 
test, even when SNPs are highly correlated. 

Conclusions 

We present and evaluate the performance of a test of 
significance for coefficients estimated using ridge regres- 
sion. We evaluate the test as applied to both linear and 
logistic ridge regression models. Our evaluation is by 
means of simulation studies across a range of scenarios 
representative of genetic data. We evaluate the test by 
comparing its performance to that of a permutation 
test. 

We evaluate the performance of the test when it is 
applied to a real data set. The data set comprises lifestyle 
data and genotypes together with lung cancer case-control 
status. Using the proposed test, we successfully replicate 
previously found associations at much reduced computa- 
tional cost compared to a permutation test. This demon- 
strates the utility of the test for detecting significant 
variables when predictor variables are highly correlated, as 
were two significant SNPs in the lung cancer data. 

Stability selection [36] is a method for variable selec- 
tion that has received attention in the literature in recent 
years. It combines subsampling of the data with a dimen- 
sion reduction technique, with the aim of finding consis- 
tently significant variables. Penalised regression methods 
that perform variable selection, such as the Lasso [10] 
and the Elastic Net [37], have been used in stability selec- 
tion. To use ridge regression with stability selection, a 
way of determining which variables to select in each sub- 
sample of the data is required. A permutation test would 
be computationally expensive in this context, rapidly 
becoming infeasible if for large n, if the guidelines of 100 
subsamples of size nl2 given by Meinshausen & Biihl- 
mann [36] were followed. The test of significance pro- 
posed here, being much less computationally expensive 
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Table 5 Performance comparison in simulated data with binary outcomes 
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Performance comparison between a permutation test and the approximate test in simulated genotype data with case-control phenotypes. Reported are 
proportion of true positive and false positive results at significance threshold a = 0.05. TPR = True Positive Rate, FPR = False positive rate. Results for each 
simulation scenario are averaged over ten replicates. 



than a permutation test, makes the combination of ridge 
regression and stability selection a feasible possibility. 

The choice of shrinkage parameter in ridge regression is 
discussed in the literature, but no consensus method pro- 
vides an universally optimum choice. The proposed test 
performs well over a range of values of A. We introduce 
the /"-value trace, a plot of the change in the negative 



logarithm of the p-value of the regression coefficients as 
the shrinkage parameter increases from zero. This trace 
can be used in combination with the ridge trace of [11] to 
visualise how the significance of the regression parameters, 
as well as their value, changes with increasing shrinkage. 
Such a plot could aid the identification of significant coef- 
ficients in the ridge model. 



Table 6 Lung cancer data 



SNP (chromosome) 


Univariate p-value 


Multivariate p-value unpenalised 


Approximate p-value 
A = 150 


Permutation p-value 
X = 150 


rs8034191 (15q25) 


0.009 


0.603 


0.007 


0.006 


rs1 6969968 (15q25) 


0.001 


0.183 


0.001 


0.000 


rs402710 (5p15) 


0.213 


0.012 


0.011 


0.004 


rs4324798 (6p21) 


0.513 


0.231 


0.248 


0.251 



Of the 35 SNPs in the lung cancer data set, four have been previously reported to be associated with lung cancer disease status. For these four SNPs, this table 
reports univariate, multivariate, approximate and permutation test p-values. 
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